In this notebook example, we'll take a look at Datumaro data exploration Python API. Specifically, we are going to provide the example codes for data exploration for image query and text query with MS-COCO 2017 dataset. Please prepare COCO 2017 validation dataset or download it referred this link COCO Prerequisite.
# Copyright (C) 2022 Intel Corporation
#
# SPDX-License-Identifier: MIT
import datumaro as dm
from datumaro.components.searcher import Searcher
from datumaro.components.visualizer import Visualizer
To use data exploration, we need to define hash for each dataset. So, set save_hash as True. The default value is False.
dataset = dm.Dataset.import_from("coco_dataset", format='coco_instances')
dataset
/home/dwekr/miniconda3/envs/datum/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.1
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
WARNING:root:File '/media/hdd2/datumaro/coco_dataset/annotations/panoptic_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File '/media/hdd2/datumaro/coco_dataset/annotations/person_keypoints_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
WARNING:root:File '/media/hdd2/datumaro/coco_dataset/annotations/captions_val2017.json' was skipped, could't match this file with any of these tasks: coco_instances
Dataset size=5000 source_path=/media/hdd2/datumaro/coco_dataset media_type=<class 'datumaro.components.media.Image'> annotated_items_count=4952 annotations_count=78647 subsets val2017: # of items=5000, # of annotated items=4952, # of annotations=78647, annotation types=['bbox', 'mask', 'polygon'] infos categories label: ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']
Set searcher with dataset which is used to database.
searcher = Searcher(dataset)
image query¶Set one of dataset as query which you want to find similar dataset.
for i, item in enumerate(dataset):
if i==50:
query = item
Use Visualizer to check which query is used.
visualizer = Visualizer(dataset, figsize=(12, 12), alpha=0)
fig = visualizer.vis_one_sample(query.id, "val2017")
fig.show()
topk_list = searcher.search_topk(query, topk=15)
subset_list = []
id_list =[]
for result in topk_list:
subset_list.append(result.subset)
id_list.append(result.id)
fig = visualizer.vis_gallery(id_list[:12], subset_list[:12], (None, None))
fig.show()
text query¶Set text as query which you want to find similar dataset. You can set it as a sentence or a word.
topk_list = searcher.search_topk('elephant', topk=15)
subset_list = []
id_list =[]
for result in topk_list:
subset_list.append(result.subset)
id_list.append(result.id)
fig = visualizer.vis_gallery(id_list[:12], subset_list[:12], (None, None))
fig.show()